{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d19521dc",
   "metadata": {},
   "source": [
    "# RAGatouille\n",
    "\n",
    "[RAGatouille](https://github.com/bclavie/RAGatouille) makes it as simple as can be to use ColBERT! [ColBERT](https://github.com/stanford-futuredata/ColBERT) is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.\n",
    "\n",
    "There are multiple ways that we can use RAGatouille.\n",
    "\n",
    "\n",
    "## Setup\n",
    "\n",
    "The integration lives in the `ragatouille` package.\n",
    "\n",
    "```bash\n",
    "pip install -U ragatouille\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "00de63d0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 10, 10:53:28] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from ragatouille import RAGPretrainedModel\n",
    "\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59d069ef",
   "metadata": {},
   "source": [
    "## Retriever\n",
    "\n",
    "We can use RAGatouille as a retriever. For more information on this, see the [RAGatouille Retriever](/docs/integrations/retrievers/ragatouille)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6407e18e",
   "metadata": {},
   "source": [
    "## Document Compressor\n",
    "\n",
    "We can also use RAGatouille off-the-shelf as a reranker. This will allow us to use ColBERT to rerank retrieved results from any generic retriever. The benefits of this are that we can do this on top of any existing index, so that we don't need to create a new idex. We can do this by using the [document compressor](/docs/modules/data_connection/retrievers/contextual_compression) abstraction in LangChain."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16d16022",
   "metadata": {},
   "source": [
    "## Setup Vanilla Retriever\n",
    "\n",
    "First, let's set up a vanilla retriever as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "6ee6af64",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "\n",
    "def get_wikipedia_page(title: str):\n",
    "    \"\"\"\n",
    "    Retrieve the full text content of a Wikipedia page.\n",
    "\n",
    "    :param title: str - Title of the Wikipedia page.\n",
    "    :return: str - Full text content of the page as raw string.\n",
    "    \"\"\"\n",
    "    # Wikipedia API endpoint\n",
    "    URL = \"https://en.wikipedia.org/w/api.php\"\n",
    "\n",
    "    # Parameters for the API request\n",
    "    params = {\n",
    "        \"action\": \"query\",\n",
    "        \"format\": \"json\",\n",
    "        \"titles\": title,\n",
    "        \"prop\": \"extracts\",\n",
    "        \"explaintext\": True,\n",
    "    }\n",
    "\n",
    "    # Custom User-Agent header to comply with Wikipedia's best practices\n",
    "    headers = {\"User-Agent\": \"RAGatouille_tutorial/0.0.1 (ben@clavie.eu)\"}\n",
    "\n",
    "    response = requests.get(URL, params=params, headers=headers)\n",
    "    data = response.json()\n",
    "\n",
    "    # Extracting page content\n",
    "    page = next(iter(data[\"query\"][\"pages\"].values()))\n",
    "    return page[\"extract\"] if \"extract\" in page else None\n",
    "\n",
    "\n",
    "text = get_wikipedia_page(\"Hayao_Miyazaki\")\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
    "texts = text_splitter.create_documents([text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "22b9dbf7",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever(\n",
    "    search_kwargs={\"k\": 10}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "50f54a7d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='collaborative projects. In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.')"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = retriever.invoke(\"What animation studio did Miyazaki found\")\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef72bb50",
   "metadata": {},
   "source": [
    "We can see that the result isn't super relevant to the question asked"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c09c803",
   "metadata": {},
   "source": [
    "## Using ColBERT as a reranker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "9653b742",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from langchain.retrievers import ContextualCompressionRetriever\n",
    "\n",
    "compression_retriever = ContextualCompressionRetriever(\n",
    "    base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever\n",
    ")\n",
    "\n",
    "compressed_docs = compression_retriever.get_relevant_documents(\n",
    "    \"What animation studio did Miyazaki found\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "35aaceee",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\\'s designs for the film\\'s setting were inspired by Greek architecture and \"European urbanistic templates\". Some of the architecture in the film was also inspired by a Welsh mining town; Miyazaki witnessed the mining strike upon his first', metadata={'relevance_score': 26.5194149017334})"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compressed_docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c2bbefc",
   "metadata": {},
   "source": [
    "This answer is much more relevant!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3746b734",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
